14 research outputs found
Finite precision deep learning with theoretical guarantees
Recent successes of deep learning have been achieved at the expense of a very high computational and parameter complexity. Today, deployment of both inference and training of deep neural networks (DNNs) is predominantly in the cloud. A recent alternative trend is to deploy DNNs onto untethered, resource-constrained platforms at the Edge. To realize on-device intelligence, the gap between algorithmic requirements and available resources needs to be closed. One popular way of doing so is via implementation in finite precision.
While ad-hoc trial and error techniques in finite precision deep learning abound, theoretical guarantees on network accuracy are elusive. The work presented in this dissertation builds a theoretical framework for the implementation of deep learning in finite precision. For inference, we theoretically analyze the worst-case accuracy drop in the presence of weight and activation quantization. Furthermore, we derive an optimal clipping criterion (OCC) to minimize the precision of dot-product outputs. For implementations using in-memory computing, OCC lowers ADC precision requirements. We analyze fixed-point training and present a methodology for implementing quantized back-propagation with close-to-minimal per-tensor precision. Finally, we study accumulator precision for reduced precision floating-point training using variance analysis techniques.
We first introduce our work on fixed-point inference with accuracy guarantees. Theoretical bounds on the mismatch between limited and full precision networks are derived. Proper precision assignment can be readily obtained employing these bounds, and weight-activation, as well as per-layer precision trade-offs, are derived. Applied to a variety of networks and datasets, the presented analysis is found to be tight to within 2 bit. Furthermore, it is shown that a minimum precision network can have up to lower hardware complexity than a binarized network at iso-accuracy. In general, a minimum precision network can reduce complexity by up to compared to a full precision baseline while maintaining accuracy. Per-layer precision analysis indicates that precision requirements of common networks vary from 2 bit to 10 bit to guarantee an accuracy close to the floating-point baseline.
Then, we study DNN implementation using in-memory computing (IMC), where we propose OCC to minimize the column ADC precision. The signal-to-quantization-noise ratio (SQNR) of OCC is shown to be within 0.8 dB of the well-known optimal Lloyd-Max quantizer. OCC improves the SQNR of the commonly employed full range quantizer by 14 dB which translates to a 3 bit ADC precision reduction. The input-serial weight-parallel (ISWP) IMC architecture is studied. Using bit-slicing techniques, significant energy savings can be achieved with minimal accuracy lost. Indeed, we prove that a dot-product can be realized with single memory access while suffering no more than 2 dB SQNR drop. Combining the proposed OCC and ISWP noise analysis with our proposed DNN precision analysis, we demonstrate reduction of energy consumption in DNN implementation at iso-accuracy.
Furthermore, we study the quantization of the back-propagation training algorithm. We propose a systematic methodology to obtain close-to-minimal per-layer precision requirements for the guaranteed statistical similarity between fixed-point and floating-point training. The challenges of quantization noise, inter-layer and intra-layer precision trade-offs, dynamic range, and stability are jointly addressed. Applied to several benchmarks, fixed-point training is demonstrated to achieve high fidelity to the baseline with an accuracy drop no greater than 0.56\%. The derived precision assignment is shown to be within 1 bit per tensor of the minimum. The methodology is found to reduce representational, computational, and communication costs of training by up to , , and , respectively, compared to the baseline and related works.
Finally, we address the problem of reduced precision floating-point training. In particular, we study accumulation precision requirements. We present the variance retention ratio (VRR), an analytical metric measuring the suitability of accumulation mantissa precision. The analysis expands on concepts employed in variance engineering for weight initialization. An analytical expression for the VRR is derived and used to determine accumulation bit-width for precise tailoring of computation hardware. The VRR also quantifies the benefits of effective summation reduction techniques such as chunked accumulation and sparsification. Experimentally, the validity and tightness of our analysis are verified across multiple deep learning benchmarks
Analytical guarantees for reduced precision fixed-point margin hyperplane classifiers
Margin hyperplane classifiers such as support vector machines are strong predictive models having gained considerable success in various classification tasks. Their conceptual simplicity makes them suitable candidates for the design of embedded machine learning systems.
Their accuracy and resource utilization can effectively be traded off each other through precision. We analytically capture this trade-off by means of bounds on the precision requirements of general margin hyperplane classifiers. In addition, we propose a principled precision reduction scheme based on the trade-off between input and weight precisions.
Our analysis is supported by simulation results illustrating the gains of our approach in terms of reducing resource utilization. For instance, we show that a linear margin classifier with precision assignment dictated by our approach and applied to the `two vs. four' task of the MNIST dataset is ~2x more accurate than a standard 8 bit low-precision implementation in spite of using ~2x10^4 fewer 1 bit full adders and ~2x10^3 fewer bits for data and weight representation
Analytical guarantees on numerical precision of deep neural networks
The acclaimed successes of neural networks often overshadow their tremendous complexity. We focus on numerical precision - A key parameter defining the complexity of neural networks. First, we present theoretical bounds on the accuracy in presence of limited precision. Interestingly, these bounds can be computed via the back-propagation algorithm. Hence, by combining our theoretical analysis and the backpropagation algorithm, we are able to readily determine the minimum precision needed to preserve accuracy without having to resort to timeconsuming fixed-point simulations. Wc provide numerical evidcncc showing how our approach allows us to maintain high accuracy but with lower complexity than state-of-the-art binary networks.1
Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training
Data clipping is crucial in reducing noise in quantization operations and
improving the achievable accuracy of quantization-aware training (QAT). Current
practices rely on heuristics to set clipping threshold scalars and cannot be
shown to be optimal. We propose Optimally Clipped Tensors And Vectors (OCTAV),
a recursive algorithm to determine MSE-optimal clipping scalars. Derived from
the fast Newton-Raphson method, OCTAV finds optimal clipping scalars on the
fly, for every tensor, at every iteration of the QAT routine. Thus, the QAT
algorithm is formulated with provably minimum quantization noise at each step.
In addition, we reveal limitations in common gradient estimation techniques in
QAT and propose magnitude-aware differentiation as a remedy to further improve
accuracy. Experimentally, OCTAV-enabled QAT achieves state-of-the-art accuracy
on multiple tasks. These include training-from-scratch and retraining ResNets
and MobileNets on ImageNet, and Squad fine-tuning using BERT models, where
OCTAV-enabled QAT consistently preserves accuracy at low precision
(4-to-6-bits). Our results require no modifications to the baseline training
recipe, except for the insertion of quantization operations where appropriate.Comment: Published as a spotlight paper at ICML 2022. Paper contains 16 pages,
5 figures, and 6 table